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Abstract 

The concept of matching dependencies (mds) is recently proposed for 
specifying matching rules for object identification. Similar to the func- 
tional dependencies (with conditions), mds can also be applied to various 
data quality applications such as violation detection. In this paper, we 
study the problem of discovering matching dependencies from a given 
database instance. First, we formally define the measures, support and 
confidence, for evaluating utility of mds in the given database instance. 
Then, we study the discovery of mds with certain utility requirements of 
support and confidence. Exact algorithms are developed, together with 
pruning strategies to improve the time performance. Since the exact al- 
gorithm has to traverse all the data during the computation, we propose 
an approximate solution which only use some of the data. A bound of 
relative errors introduced by the approximation is also developed. Finally, 
our experimental evaluation demonstrates the efficiency of the proposed 
methods. 



1 Introduction 

Recently, data quality has become a hot topic in database community due to 
huge amount of "dirty" data originated from different resources (see [3] for a 
survey). These data are often "dirty", including inconsistencies, conflicts, and 
errors, due to various erroneous introduced by human and machines. In addition 
to cost of dealing the huge volume of data, manually detecting and removing 
"dirty" data is definitely out of practice because human proposed cleaning meth- 
ods may introduce inconsistencies again. Therefore, data dependencies, which 
have been widely used in the relational database design to set up the integrity 
constraints, have been revisited and revised to capture wider inconsistencies in 
the data. For example, consider a Contacts relation with the schema: 

Contacts(SIN, Name, CC, ZIP, City, Street) 

The following functional dependency fd specifies a constraint that for any two 
tuples in Contacts, if they have the same ZIP code, then these two tuples have 
the same City as well. Recently, junctional dependencies (fds) have been ex- 
tended to conditional functional dependencies (CFDs) [5], i.e., FDs with condi- 
tions, which have more expressive power. The basic idea of these extensions 
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is making the fds, originally hold for the whole table, valid only for a set of 
tuples. For example, the following cfd specifies that only in the condition of 
country code CC = 44, if two tuples have the same ZIP, then they must have 
same Street as well. 

fd : [ZIP] [City] 
cfd : [ZIP, CC = 44] -> [Street] 

These dependency constraints can be used to detect data violations [11]. For 
instance, we can use the above fd to detect violations in an instance of Contacts 
in Table 1. For the tuples t 5 and £@ with the same values of ZIP = 021, they 
have different values of City, which are then detected as violations of the above 
fd. 

Although functional dependencies (and their extension with conditions) are 
very useful in determining data inconsistency and repairing the "dirty" data [11], 
they check the specified attribute value agreement based on exact match. For 
example, with the above cfd, tuples that have CC = 44 and the same value 
on ZIP attribute will be checked to see whether they have exactly matched 
values on Street. Obviously, this strict exact match constraint limits usage of 
FDs and CFDs, since real-world information often have various representation 
formats. For example, the tuples £2 and £3 in Contacts table will be detected as 
"violations" of the cfd, since they have "different" Street values but agree on ZIP 
and CC = 44. However, "No.2, Central Rd." and "#2, Central Rd." are exactly 
the "same" street in the real- world with different representation formats. 

To make dependencies adapt to this real- world scenario, i.e., to be tolerant 
of various representation formats, Fan [13] proposed a new concept of data de- 
pendencies, called matching dependencies (mds). Informally, a matching depen- 
dency targets on the fuzzy values like text attributes and defines the dependency 
between two set of attributes according to their matching quality measured by 
some matching operators (see [4] for a survey), such as Euclidean distance and 
cosine similarity. Again, in Contacts example, we may have a MD as 

mdi : ([Street] -> [City], < 0.8, 0.7 >) 

which states that for any two tuples from Contacts, if they agree on attribute 
Street (the matching similarity, e.g. cosine similarity, on the attribute Street 
is greater than a threshold 0.8), then the corresponding City attribute should 
match as well (i.e. similarity on City is greater than the corresponding threshold 
0.7). 

Similar to the FDs related techniques, MDs can be applied in many tasks 
as well [13]. For example, in data cleaning, we can also use MDs to detect the 
inconsistent data, that is, data do not follow the constraint (rule) specified by 
mds. For example, according to the above md example, for any two tuples £j 
and tj having similarity greater than 0.8 on Street, they should be matched on 
City as well (similarity > 0.7). If their City similarity is less than 0.7, then there 
must be something wrong in £, and tj, i.e., inconsistency. Such inconsistency on 
text attributes cannot be detected by using fds and extensions based on exact 
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Tabic 1: Example of Contacts relation TZ 



SIN 


Name 


CC 


ZIP 


City 


Street 




584 


Claire Green 


44 


606 


Chicago 


No.2, Central Rd. 


ti 


584 


Claire Greem 


44 


606 


Chicago 


No.2, Central Rd. 


*2 


584 


Claire Gree_ 


44 


606 


Chicago 


#2, Central Rd. 


tz 


265 


Jason Smith 


01 


021 


Boston 


No.3, Central Rd. 


U 


265 


J. Smith 


01 


021 


Boston 


#3, Central Rd. 


is 


939 


W. J. Smith 


01 


021 


Chicago 


#3, Central Rd. 


U 



matching. In addition to locating the inconsistent data, object identification, 
another important work for data cleaning, can also employ MDs as matching 
rules [15]. For instance, according to 

md 2 : ([Name, Street] -> [SIN], < 0.9, 0.9, 1.0 >) 

if two tuples have high similarities on Name and Street (both similarities are 
greater than 0.9), then these two tuples probably denote the same person in the 
real world, i.e., having the same SIN. 

Though the concept of matching dependencies is given in [13], the authors 
did not discuss how to discover useful mds. In fact, given a database instance, 
there are enormous MDs that can be discovered if we set different similarity 
thresholds on attributes. Note that if all thresholds are set to 1.0, MDs have the 
same semantics as traditional FDs, in other words, traditional FDs are special 
cases of MDs. For instance, the above fd can be represented by a md ([ZIP] — > 
[City], < 1.0, 1.0 >). Clearly, not all the settings of thresholds for mds are useful. 

The utility of mds in the above applications is often evaluated by confidence 
and support. Specifically, we consider a MD of a relation TZ, denoted by ip(X — > 
Y,X), where X and Y are the attribute sets of TZ, A is a pattern specifying 
different similarity thresholds on each attribute in X and Y. Let Xx and Ay be 
the projections of thresholds in pattern A on the attributes X and Y respectively. 
The support of ip is the proportion of tuple pairs whose matching similarities are 
higher than the thresholds in ip on both attributes of X and Y. The confidence 
is the ratio of tuple pairs whose matching similarities satisfy Xx also satisfying 
Ay. In real applications like inconsistency detection, in order to achieve high 
detection accuracy, we would like to use MDs with high confidence. On the 
other hand, if users need high recall of detection, then MDs with high support 
are preferred. Intuitively, we would like to discover those mds with high support, 
high confidence and high matching quality. Therefore, in this work, we would 
like to discover proper settings of matching similarity thresholds for mds, which 
can satisfy users' utility requirements of support and confidence. 

Contributions In this paper, given a relation instance and X — > Y, we study 
the issues of discovering matching dependencies on the given X — > Y. Our main 
contributions are summarized as follows: 

First, we propose the utility evaluation of matching dependencies. Specifi- 
cally, the confidence and support evaluations of MDs are formally defined. To 
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Tabic 2: Notations 


Symbol 


Description 


<P 


Matching dependency, md 


A 


Threshold pattern, of matching similarity 


Ct 


Candidate set, of total c threshold patterns 


la 


Minimum requirement, of support 


Vc 


Minimum requirement, of confidence 


K 


Original relation, of N data tuples t 


V 


Statistical distribution, of n statistical tuples s 



the best of our knowledge, this is the first paper to study the utility evaluation 
and discovery of MDs. 

Second, we study the exact algorithms for discovering mds. The mds discov- 
ery problem is to find settings of matching similarity thresholds on attributes 
X and Y for mds that can satisfy the required confidence and support. We first 
present an exact solution and then study pruning strategies by the minimum 
requirements of support and confidence. 

Third, we study the approximation algorithms for discovering mds. Since 
the exact algorithm has to traverse all the data during the computation, we 
propose an approximate solution which only use some of the data. A bound 
of relative errors introduced by the approximation is developed. Moreover, we 
also develop a strategy of early termination in individual step. 

Finally, we report an extensive experimental evaluation. The proposed algo- 
rithms on discovering mds arc studied. Our pruning strategies can significantly 
improve the efficiency in discovering MDs. 

The remainder of this paper is organized as follows. First, we introduce 
some related work in Section 2. Then, Section 3 presents the utility measures 
for mds, including support and confidence. In Section 4, we develop the exact 
algorithm for discovering mds and study the corresponding pruning strategies. 
In Section 5, we present the approximation algorithm with bounded relative 
errors. In Section 6, we report our extensive experimental evaluation. Finally, 
Section 7 concludes this paper. Table 2 lists the frequently used notations in 
this paper. 

2 Related Work 

Traditional dependencies, such as functional dependencies (fds) and inclusion 
dependencies (inds) for the schema design [1], are revisited for new applications 
like improving the quality of data. The conditional functional dependencies 
(cfds) are first proposed in [5] for data cleaning. Cong et al. [11] study the 
detecting and repairing methods of violation by cfds. Fan et al. [16] inves- 
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tigate the propagation of CFDs for data integration. Bravo et al. [6] propose 
an extension of CFDs by employing disjunction and negation. Golab et al. [17] 
define a range tableau for CFDs, where each value is a range similar to the con- 
cept of matching similarity intervals in our study. In addition, Bravo et al. [7] 
propose conditional inclusion dependency (ciNDs), which are useful not only in 
data cleaning, but are also in contextual schema matching. Ilyas et al.[20] study 
a novel soft fd, which is also a generalization of the classical notion of a hard 
FD where the value of X completely determines the value of Y. In a soft fd, 
the value of X determines the value of Y not with certainty, but merely with 
high probability. 

The confidence and support measures are widely used in discovering ap- 
proximate functional dependencies [19, 21] and evaluating CFDs [17, 9, 14]. The 
confidence can be interpreted as an estimate of the probability that a randomly 
drawn pair of tuples agreeing on X also agree on Y [22, 8]. Scheffer [27] study 
the trade off between support and confidence for finding association rules [2] , by 
computing a expected prediction accuracy. In addition, Chiang and Miller [9] 
also study some other measures such as conviction and x 2 -test for evaluating 
dependency rules. When a candidate X — > Y is suggested together with min- 
imum support and confidence, Golab et al. [17] study the discovery of optimal 
CFDs with the minimum pattern tableau size. A concise set of patterns are 
naturally desirable which may have lower cost during the applications such as 
violation detection by CFDs. On the other hand, Chiang and Miller [9] explore 
CFDs by considering all the possible dependency candidates when X — > Y is not 
specified. In [14], Fan et al. also study the case when the embedded fds are 
not given, and propose three algorithms for different scenarios. 

The concept of matching dependencies (mds) is first proposed in [13] for 
specifying matching rules for the object identification (see [12] for a survey). 
The MDs can be regarded as a generalization of FDs, which are based on iden- 
tical values having matching similarity equal to 1.0 exactly. Thus, FDs can be 
represented by the syntax of MDs as well. For any two tuples, if their X values 
are identical (with similarity threshold 1.0), then a fd (X — > Y) requires that 
their Y values are identical too, i.e., a MD (X — > Y, < 1.0,1.0 >). Koudas et 
al. [23] also study the dependencies with matching similarities on attributes Y 
when given the exactly matched values on X, which can be treated as a special 
case of MDs. The reasoning mechanism for deducing MDs from a set of given 
MDs is studied in [15]. The MDs and their reason techniques can improve both 
the quality and efficiency of various record matching methods. 

3 Utility Measures 

In this section, we formally introduce the definitions of MDs. Then, we develop 
utility measures for evaluating MDs over a given database instance. 

Traditional functional dependencies fds and their extensions rely on the 
exact matching operator = to identify dependency relationships. However, in 
the real world application, it is not possible to use exact matching operator = 
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to identify matching over fuzzy data values such as text values. For instance, 
Jason Smith and J.Smith of attribute Name may refer to the same real world 
entity. Therefore, instead of FDs on identical values, the matching dependencies 
mds [13] are proposed based on the matching quality. For text values, we can 
adopt the similarity matching operators, denoted by w, such as edit distance [26], 
cosine similarity with word tokens [10] or q-grams [18]. 

Consider a relation 7?.(A, ■ • ■ , Aw) with M attributes. Following similar 
syntax of FDs, we define MDs as following: 1 

Definition 1. A matching dependency (md) tp is a pair (X — > Y,X), where 
X C 7Z, Y C 7Z, and X is a threshold pattern of matching similarity thresholds 
on attributes in X U Y , e.g., X[A] denotes the matching similarity threshold on 
attribute A. 

A MD tp specifies a constraint on the set of attributes X to Y. Specifically, 
the constraint states that, for any two tuples t\ and t<i in a relation instance 
r of TZ, if /\ AieX ti[Ai] ^\[Ai] t 2 [Ai], then Aa jS y M^j] ~a[^] t 2 [A J ], where 
X[Ai] and A [A,] are the matching similarity thresholds on the attributes of Aj 
and Aj respectively. In the above constraint, for each attribute A: G X U 
Y, the similarity matching operator « indicates true, if the similarity between 
ii[A] and £2 [A] satisfies the corresponding threshold A [A]- For example, a md 
(^([Street] — > [City], < 0.8,0.7 >) in the Contacts relation denotes that if two 
tuples has similar Street (with matching similarity greater than 0.8) then their 
City values are probably similar as well (with similarity at least 0.7). 

Like fds and CFDs [17, 9], we adopt support and confidence measures to 
evaluate the matching dependencies. According to the above constraint of MDs, 
we need to consider the matching quality (e.g., cosine similarity or edit distance) 
of any pair of tuples t\ and t% for TZ. Therefore, we compute a statistical 
distribution (denoted by V) of the quality of pair-wised tuple matching for 
1Z. The statistical distribution has a schema T>{A\, . . . , Am, P), where each 
attribute A in T> corresponds to the matching quality values on the attribute 
A °f T^i and P is the statistical value. Let s be a statistical tuple in T>. The 
statistic s[P] denotes the probability that any two tuples t\ and ti of 1Z have 
the matching quality values s[A], VA G TZ- With a pair- wised evaluation 
of matching quality of all the N tuples for 7Z, we can easily compute P by 
n*{n-i)/2 ' wnerc count(s) records the pairs of tuples having matching quality 
s. Different matching operators have various spaces of matching values, such as 
cosine similarity in [0.0, 1.0] while edit distance having edit operations 1,2, ... . 
In order to evaluate in a consistent environment, wc map these matching quality 
values s[A] to a unified space, say [0, d— 1], which is represented by dom(A) with 
d elements. Table 3 shows an example of the statistical distribution T> computed 
from Contacts in Table 1 by mapping 2 the cosine similarities in [0.0, 1.0] to 
elements in [0,d — 1] of dom(A) with d = 10. According to dom(A) in our 

1 The MDs syntax is described with two relation schema Ri,B,2 for object identification 
in [13], which can also be represented in a single relation schema R as the FDs. 
2 E.g., cosine similarity value s times d — 1 
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example, the first tuple (1, 0, 3, . . . , 0.065) denotes that there are about 6.5% 
matching pairs in all pair-wised tuple matching, whose similarities are 1, 0, 3, . . . 
on the attribute Ax, A%, A3, . . . respectively. 

Table 3: Example of statistical distribution T> 



A x 


A 2 


^3 


A 4 


A 5 


^6 


P 


1 





3 


5 


8 


4 


0.065 


7 


4 








4 


1 


0.043 





4 


8 


1 


6 


2 


0.124 



Then, we can measure the support and confidence of MDs, with various 
attributes X and Y, based on the statistical distribution D. Let Ax and Ay be 
the projections of matching similarity threshold pattern A on the attributes of 
X and Y respectively in a MD ip, which are also specified in terms of elements 
in dom(A) of each A S X U Y. Let Z be the set of attributes not specified 
by ip, i.e., TZ \ (X U Y). The definitions of support and confidence for the MD 
f(X — > Y, A) are presented as follows: 

support^) = P(X NA x ,yNA y ) 

= ]TP(XN \ X ,Y N \ Y ,Z) 
z 

confidence^) = P(Y N Ay | X \= X x ) 

J2 Z P(X^ \ X ,Y^ \ Y ,Z) 

j: yz p(x^x x ,y,z) 

where N denotes the satisfiability relationship, i.e., X \= X x denotes that the 
similarity values on all attributes in X satisfy the corresponding thresholds 
listed in X x . For example, we say that a statistical tuple s in T> satisfies X x , 
i.e., s[X] N Ax, if s has similarity values higher than the corresponding minimum 
threshold, i.e., s[A] > X[A], for each attribute A in X. 

Consider any two tuples t\ and ti from the original data relation 1Z, the 
support^) estimates the probability that the matching similarities of t\ and ti 
on attributes X and Y satisfy the thresholds specified by Ax and Ay, respec- 
tively. Similarly, the confidence^) computes the conditional probability that 
the matching similarities between tx and t 2 on Y satisfy the thresholds spec- 
ified by Ay (i.e., Y N Ay) given the condition that tx and t 2 are similar on 
attributes X (i.e., X N Ax). Thus, high confidence^) means few instances of 
matching pairs that are similar on attributes X (i.e., X 1= Ax) but not similar 
on attributes Y (i.e., Y ¥ Ay), where denotes the unsatisfiability relationship. 

In real applications like inconsistency detection, in order to achieve high 
detection accuracy, we would like to use mds with high confidence. On the 
other hand, if users need high recall of detection, then mds with high support 
are preferred. Intuitively, we would like to discover those MDs with high support 
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and high confidence. Therefore, in the following of this paper, we study the 
problem of discovering MDs that can satisfy users minimum utility requirement 
of support rj s and confidence ?7 S . 

4 Exact Algorithm 

We now study the determination of matching similarity threshold pattern for 
MDs based on the statistical distribution, which is a new problem different from 
FDs. In fact, once the X — * Y is given for a fd, it already implies the similarity 
threshold to be 1.0, that is, (X — » Y, < 1.0, 1.0 >) if it is represented by the MD 
syntax. Unlike fds, we have various settings of matching similarity thresholds 
for MDs. Therefore, in this section, we discuss how to find the right similarity 
thresholds in order to discover the MDs satisfying the required support and 
confidence. 

4.1 Problem Statement 

In order to discover a md ip with the minimum requirements of support rj s 
and confidence r/ c , the following preliminary should be given first: (I) what is 
Y? and (II) what is matching quality requirement Ay. These two preliminary 
questions are usually addressed by specific applications. For example, if we 
would like to use discovered MDs to guide objet identification in the Contacts 
table, then Y = SIN. The Ay is often set to high similarity thresholds by 
applications to ensure high matching quality on Y attributes. For example, 
Ay is set to 1.0 for Y = SIN in the object identification application. Note 
that without the preliminary Ay, the discovered MDs will be meaningless. For 
example, a md with Ay = can always satisfy any requirement of rj c , rj s . Since 
all the statistical tuples can satisfy the thresholds Ay = 0, the corresponding 
support and confidence will always be equal to 1.0. 

Definition 2. The threshold determination problem of MDs is: given the mini- 
mum requirements of support and confidence rj s , rj c and the matching similarity 
threshold pattern Ay, find all the MDs p(X — > Y, A) with threshold pattern Xx on 
attributes X having confidence(^) > r/ c and support(^) > rj s , if exist; otherwise 
return infeasible. 

The attributes X can be initially assigned by 1Z \ Y if no suggestion is 
provided by specific applications, since our discovery process can automatically 
remove those attributes that are not required in X for a md ip. Specifically, 
when a possible discovered threshold X[A) on attribute A is £ dom(A), it 
means that any matching similarity value of the attribute A £ X can satisfy the 
threshold and will not affect the MD ip at all. In other words, the attribute A 
can be removed from X of the MD p. 



4.2 Exact Algorithm 



Now, we present an algorithm to compute the matching similarity thresholds 
on attributes X for MDs having support and confidence greater than ij s and r/ c , 
respectively. Let Ax,... ,A mx be the nix attributes in X. For simplicity, we 
use A to denote the threshold pattern projection Ax with A[Ai], . . . , X[A mx ] on 
all the mx attributes of X. Since, each threshold X[A] on attribute A is a value 
from dom(A), i.e., X[A] £ dom(A), we can investigate all the possible candidates 
of threshold pattern A. Let Ct be the set of all the possible threshold pattern 
candidates, having 



The total number of candidates is c = \Ct\ = |dom(X)| = d m , where d is the 
size of dom(vl). 

Let n be the number of statistical tuples in the input statistical distribution 
V. We consider two statistical values P/ (X, Y) and P? (X), which record P(X \= 
Ax, Y 1= Ay) and P(X 1= Ax) respectively for the candidate Xj <E C t based on the 
information of the first i tuples in V, initially having P^(X,Y) = Pq(X) = 0. 
The recursion is defined as follows, with i increasing from 1 to n and j increasing 
from 1 to c. 



Finally, those Xj can be returned if support = P£ > r] s and confidence = 



Algorithm 1 Exact algorithm EA(D,C t ) 
l: for each candidate Xj G Ct,j : 1 — > c do 
2: P>(X,Y)=P3(X) = Q 

3: for each statistical tuples Sj s T), i : 1 — > n do 
4: compute P/ (X,Y), P? (X) 

5: return Xj with confidence and support satisfying n c , n s 



We can implement the exact algorithm (namely ea) by considering all the 
statistical tuples Sj in T> with i from 1 to n, whose time complexity is 0(nc). 

4.3 Pruning Strategies 

Since the original exact algorithm needs to traverse all the n statistical tuples 
in T> and c candidate threshold patterns in Ct, which is very costly. In fact, 
with the given rj s and rj c , we can investigate the relationship between similarity 



C t = dom(Ai) x ■ • ■ x dom(A mA . ) = dom(X). 




P(_ 1 (X, Y ) + s t [P] , if s t [X] t= A, , Sl [Y] N Ay 



P?_ AX, Y), otherwise 




otherwise 



9 



thresholds and avoid checking all candidate threshold patterns in Ct and all 
statistical tuples in T>. Therefore, in the following two subsections, we present 
pruning techniques based on the given support and confidence, respectively. 

Pruning by support We first study the relationships among different thresh- 
old patterns, based on which we then propose rules to filter out candidates that 
have supports lower than r\ s . 

Definition 3. Given two similarity threshold patterns Ai and A2, if Ai[A] < 
X2IA] holds for all the attributes, VA £ X, then Ai dominates A2, denoted as 
Ai < A 2 . 

Based on the dominate definition, the following Lemma describes the rela- 
tionships of supports between similarity threshold patterns. 

Lemma 1. Given two mds, (pi = {X — > Y, Ai) and ip2 = (X — > Y, A2) over 
the same relation instance of 1Z, if Ai dominates A2, Ai < A2 , then we have 
support(ipi) > support^)- 

Proof. Let cover(Ai) and cover^) denote the set of statistical tuples that satisfy 
the threshold Ai and A2 respectively, e.g., cover (A2) = {s | s[X] \= A2, s 6 V}. 
According to the minimum similarity thresholds, for each attribute A, we have 
A2 [A] < s[A]. In addition, since Ai < A2, for any tuple s G cover(A2), we also have 
Ai[A] < A2L4] < s[A] on all the attributes A. In other words, the set of statistical 
tuples covered by A2 also satisfy the threshold of Ai, i.e., cover^) C cover(Ai). 
Referring to the definition of support, we have support(</?i) > support!^)- D 

According to Lemma 1, given a candidate similarity threshold pattern Xj 
having support lower than the user specified requirement rj s , i.e., P^(X, Y) < rj Sl 
all the candidates that are dominated by Xj should have support lower than 
rj s and can be safely pruned without computing their associated support and 
confidence. 

We present the implementation of pruning by support (namely EPS) in Al- 
gorithm 2. 



Algorithm 2 Pruning by support EPS(£>, C t ) 
1: for each candidate Xj G Ct,j : 1 — ► c do 

2: Q a j = Q b 3 = 

3: for each tuple s, 6 V, i : 1 — > n do 
4: compute Qfj,P?(X) 
5: if Q a J < rj s then 

6: remove all the remaining candidates A' dominated by Xj from Ct 

{Pruning by support, A' > Xj} 
7: return Xj with confidence and support satisfying r] c , rj s 



In order to maximize the pruning, we can heuristically select an ordering 
of candidates in Ct that for any ji < j'2 having Xj 1 < Xj 2 . That is, we always 
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first process the candidates that dominate others. In fact, we can use a DAG 
(directed acyclic graph), Q, to represent candidate similarity patterns as vertices 
and dominant relationships among the similarity patterns as edges. Therefore, 
the dominant order of candidate patterns can be obtained by a BFS traversal 
upon Q. 



Pruning by confidence Other than pruning by support, we can also utilize 
the given confidence requirement to avoid further examining tuples that have 
no improvement of confidence when the confidence is already lower than n c for 
a candidate A.,. 

Wc first group the statistical tuples in T> into two parts based on the pre- 
liminary Ay as follows. Let k be a pivot between 1 and n. For the first k 
tuples, we have Sj[V] 1= Ay, 1 < i < k. All the remaining n — k tuples have 
Si[Y] Y 1 Ay, k + 1 < i < n. This grouping of statistical tuples in T> can be done 
in linear time. 

Lemma 2. Consider a pre-grouped statistical distribution T>. For any 1 < i\ < 
ii < n, we always have 

PjjX^Y) > P? 2 (X,Y) 
Pl(X) ~ Pl(X) ' 

Proof. Since the first k tuples have Si[Y] N Ay, according to the computation 
of P{X,Y) and P(X), we have 

2arUi. 0> i<i< k . 

Pi(x) - - 

Moreover, for the remaining n — k tuples with Si[Y] ¥ Ay, the P(X,Y) value 
will not change any more, i.e. , Pf (X, Y) = Pi (X, Y), k + 1 < i < n. Meanwhile, 
the corresponding P{X) is non-decreasing, that is, Pji{X) < P^(X) < P? (X) 
for any k + 1 < i\ < ii < n. Consequently, we have 

P^(X,Y) P? 2 (X,Y) 

— —. > — —. , k + 1 < i\ < %•> < n. 

P^X) ~ Pi 2 (X) ~ 

Combining above two statements, we proved the lemma. □ 

Therefore, according to the formula of confidence, with the increase of i 
from 1 to 7i, the confidence of a specific candidate Xj is non- increasing. For a 
candidate Xj, when processing the statistical tuple Si, if the current confidence 

pj fx y) 

p)( X j is lower than ?y c , then we can prune the candidate Xj without considering 
the remaining statistical tuples from i + 1 to n in D. 

Finally, both the pruning by support and the pruning by confidence are 
cooperated together into a single threshold determination algorithm as shown 
in Algorithm 3(namely epsc). We also demonstrate the performance of the 
hybrid pruning EPSC in Section 6. 
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Algorithm 3 Pruning by support & confidence EPSC(2?, C t ) 

1: for each candidate Xj £ Ct, j : 1 — > c do 

2: P^(X,Y)=P^(X) = 

3: for each tuple s, e P, z : 1 — > n do 

4: compute P/ (X, Y), P/ (X) 

5: lf p)pf) < ??c then 

6: remove Xj from Ct {Pruning by confidence} 

7: if P/ \X, Y) > r) s then 

8: break 

9: if P^(X, Y) < rj s then 

10: remove all the remaining candidates A' dominated by Xj from Ct 

{Pruning by support, A' > Xj} 

11: return Aj with confidence and support satisfying r) Cl rj s 



5 Approximation Algorithm 

Though we have proposed pruning rules for exact method (Algorithm 3), the 
whole evaluation space is still all the n tuples in statistical distribution T>. 
Therefore, in this section, we present an approximate algorithm which only 
traverses the first k (k = 1, . . .ri) tuples in 2?, with bounded relative errors on 
support and confidence of returned mds. 

Let C" and S n be the confidence and support computed in the exact solution 
with all n tuples. We study the approximate confidence and support, C k and 
5* fc , by ignoring the statistical tuples from Sfe+i to s n . For a candidate threshold 
pattern Xj G Ct, let 

P = P J k (X), p = Pi(X) - Pi(X), 

where (3 denotes P(X N Ax) for the candidate Xj based on the first k tuples in 
V, and f3 is P(X N Xx) based on the remaining n — k tuples. The following 
Lemma indicates the error bounds of C k and S k when (3 for a specific k is in a 
certain range. 

Lemma 3. If we have < min(ery s , j^rfr ), then the error of approximate 

confidence C k compared to the exact confidence C" is bounded by — e < c ^ < 
e, and the error of approximate support S k compared to the exact S n is bounded 
by sn g n sk < e. 

Proof. Let 

a = Pj>(X,Y) 
a = Pi(X,Y) - 4(X,Y) 

According to the computation of confidence, we have C k 



- ^ ana o - 
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Let Z = 1 - c "' c „ c ' = g^-, that is, 

z _ a(f3 + (3) g 
/3(a + a) ~ /3 

First, we have /3 = a + J2i=i s i[P{X N \j,Y ¥ Ay)] > a. Note that a is the 
approximate support of the md 93 with matching similarity threshold pattern A.,- 
on the attributes X . According to the minimum support constraint, for a valid 
Xj, we have 8 > a > rj s . Thereby, 

Z<l + l 

Moreover, according to the condition /3 < min(e77 s , ), that is (3 < er) s , we 

have 

Z < 1 + e 

Second, similar to /3 > a, we also have a < 8 for the tuples from fc + 1 to n. 
Therefore, 

<*(/? + ff) _ (3 + p 
-f3(a + 0) p + 

" a 

According to the minimum confidence % > r\ Cl 

z>£±i=i-iQ-ri (i) 

- ^ + A 8^ + 8 K 1 

Recall that 8 > i] s and the confidence should be lower than or equal to 1, i.e., 
f]c < 1- Thus, 

Z>1 _^-Vc) =1 _ 



Since we have the condition 8 < er i sT i" 

r — l— e —rj c i 

„ 1 — Vr 

Z>\- -, — = 1-6 

- — 1 — 1 
e 

Finally, based on the above two conditions, we conclude that 

C" - c k c k 

1 + e > Z = 1 = — > 1 - e 

C" (7™ ~ 

C n _ C k 

— e < < e 

C n 

On the other hand, according to the computation of support, we have S k = a 
and S n = a + a. Therefore, 

S n - S k _L_ 
S» ~ 1 + 2 
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Recall that we have a > rj s and a < (3 < erj s . 

S n -S k 1 e 

< r = < e 

S n "I + 7 l + e 

That is, the worst-case relative error is bounded by e for both the confidence 
and support. □ 

Now, we consider the last n — k tuples in V. Let 

n 

i=k+l 

where Si[P] is the probability associated to each statistical tuple in V. Referring 
to the definition of j3, for any Xj, we always have (3 < B(k). If there exists a k 
having B(k) < min(e?7 s , ^f' 7 ^ ), then j3 < min(e?7 s , ) is satisfied for all 

the threshold candidates Xj. Since the B{k) decreases with the increase of k, to 
determine a minimum k is to find a corresponding maximum B{k). Therefore, 
according to Lemma 3, given an error bound e, < e < 1 — ?/ c , we can compute 
a minimum position k = argmaxjj =1 B{k) having B{k) < min(e?7 s , jzrfz^j )■ 

Theorem 1. Given an error bound e,0 < e < 1 — r\ c , we can determine a 
minimum k, having 

B(k) < min(e7/ s , _ £7?s7?c ), 1 < k < n. 
1 - e - Vc 

The approximation by considering first k tuples in T> finds approximate MDs 
with the error bound e on both the confidence and support compared with the 
exact one. The complexity is 0(kc). 

Finally, we present the approximation implementation in Algorithm 4. Let 
B denotes B(k) = J27=k+i s j[^ > ] f° r the current k. With k decreasing from 
n to 1, we can determine a minimum k where B = B(k) < min(e77 s , j^ffr ) 
is still satisfied. After computing k, we process the tuples Si starting from 
i = 1. When the bound condition is first satisfied, i.e., i = k with B = B(k) < 
min(e77 s , ^373^-), the processing terminates. Here, the error bound e is specified 
by user requirement with < e < 1 — r\ c . 

Given an error bound e, the bound condition is then fixed. In order to 
minimize fc, we expect that the P values of the tuples from k + 1 to n in 
B(k) = X)J=fc+i s j\P] are small. In other words, an instance of T> with higher 
P in the tuples from 1 to k is preferred. Therefore, we can reorganize the tuples 
in T> in the decreasing order of P as the input of Algorithm 4. The ordering of 
statistical tuples in T> by the P values can be done in linear time by amortizing 
the P values into a constant domain. 
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Algorithm 4 Approximation algorithm AP(2?, C t ) 



1: for each tuple Sk € T>, k : n — » 1 do 

2: B+=S k [P] 

3: if P > min(e?7 s , then 

4: £;++; break {Compute k} 

5: for each candidate Xj £ Ct,j : 1 —> c do 

6: p J (A,r) = p J (A) = o 

7: for each tuple Sj £ T>, i : 1 — > k do 

8: compute P/ (X, y), P/' (X) 

9: return Xj with confidence and support satisfying r/ c , r/ s 



Approximation Individually We study the approximation by each individ- 
ual candidate Xj with a more efficient bound condition respectively. According 
to formula (1) in the proof of error bound, we find that for each specific can- 
didate Xj if /3 < min(e/3, ), then the error bound is already satisfied and 
the processing can be terminated for this Xj. Therefore, rather than one fixed 
bound condition for all the candidates, the bound of j3 can be determined dy- 
namically for each candidate Xj respectively during the processing. Algorithm 5 
shows the implementation of approximation with dynamic bound condition on 
each candidate individually. 



Algorithm 5 Approximation individually API(X>, Ct) 
for each tuple s, G T>,i : n — > 1 do 
B += Si [P] 

if B < min(e77 s , ) then 

k = i {Compute k) 

for each candidate Xj € Ct , j : 1 — ► c do 
P J (A,F)=PJ(A) = 
P,=P 

for each tuple s, £ P, i : 1 — > fc do 
compute P/ ( A, Y ) , P/ (A) 
= P?(X) 
Bj -= Si[P] 

if B 3 < min(e^, then 
break 

return Xj with confidence and support satisfying r) c , rj s 



Corollary 1. The worst case complexity of the approximation individually is 
0{kc) 

Proof. Note that with the increasing of i from 1 to k, for a specific Xj, the 
value P increases and Bj decreases. For any i < k, if /3 < rj s , i.e., Xj is invalid 
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Figure 1: Pruning onFigure 2: Pruning onFigure 3: Pruning on 

CiteSeer Cora Restaurant 

currently, the bound condition cannot be satisfied having 



min(e/3, 



< min(e?7 s , — —J < Bj. 



1 - e - 77c l-e-rjc 
When Xj has (3 > r/ s as a valid threshold, the bound condition is relaxed from 
min(e?7 s , p^fr ) to min(e/3, jjf^~ )■ Thereby, the bound condition may be 
satisfied by a smaller i than fc, i.e., 

mm e)),,, j < Bj < mm(ep, j. 

1 - e - »7c 1 - e - rjc 

The worst case is that all candidates do not achieve their bounds until processing 
the tuple Sfe, where 

B, = B(k) < minUris, ) < min(e/3, £/??7c ) 

1-e — rjc l — e-r] c 

must be satisfied. This is exact the Algorithm 4 without individual approxima- 
tion. □ 

Finally, we cooperate the pruning by support together with the approx- 
imation (namely APS) and the approximation individually (namely APSl) re- 
spectively. As we presented in the experimental evaluation, the approximation 
techniques can further improve the discovering efficiency with an approximate 
solution very close to the exact one (bounded by e). 



6 Experimental Evaluation 

Now, we report the experiment evaluation on proposed methods. All the algo- 
rithms arc implemented by Java. The experiment evaluates on a machine with 
Intel Core 2 CPU (2.13 GHz) and 2 GB of memory. 
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Figure 4: Advanced onFigure 5: Advanced onFigure 6: Advanced on 
CiteSeer Cora Restaurant 

Experiment Setting In the experimental evaluation, we use three real data 
sets. The Cora 3 data set, prepared by McCallum et al. [24], consists of 12 at- 
tributes including author, volume, title, institution, venue, etc. The Restaurant 4 
data set consists of restaurant records including attributes name, address, city 

and type. The CiteSeer 5 data set is selected with attributes including title, author, address, affiliation, subject, desc 
etc. We use the cosine similarity to evaluate the matching quality of the tuples 
in the original data. By applying the dom(A) mapping in Section 3, we can 
obtain statistical distributions with at most 186, 031 statistical tuples in Cora, 
140,781 statistical tuples in Restaurant and 314,382 statistical tuples in Cite- 
Seer. Our experimental evaluation is then conducted in several pre-processed 
statistical distributions with various sizes of statistical tuples n from 10, 000 to 
150,000 respectively. 

Wc mainly observes the efficiency of proposed algorithms. Since our main 
task is to discover mds under the required rj s and rj c , we study the runtime 
performance in various distributions with different rj s and i] c settings. The 
discovery algorithms determine the matching similarity settings of attributes 
for MDs. Suppose that users want to discover MDs on the following X — > Y of 
three data sets respectively: i) the dependencies on 

Cora : author, volume, title — > venue 

with the preliminary requirement of minimum similarity 0.6 on venue; ii) the 
dependencies on 



Restaurant : name, address, type — ► city 



3 http:/ /www. cs.umass.edu/_mccallum/codc-data. html 
4 http: / /www. cs. utexas.edu/users/ml/riddle/data. html 
5 http:/ /citeseer. ist.psu.edu/ 
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with the preliminary requirement of minimum similarity 0.5 on city; and iii) the 
dependencies on 

CiteSeer : address, affiliation, description — > subject 

with preliminary 0.1 on subject, respectively. 

A returned result is either infeasible, or a MD with threshold pattern on the 
given X — ► Y, for example, one of the result returned by real experiment on 
Cora is: 

(^(author, volume, title — > venue, < 0.6, 0.0, 0.8, 0.6 >) 

with support^) = 0.020 and confidence^) = 0.562 both greater than the spec- 
ified requirements of r/ s and rj c respectively. 

Exact Approach Evaluation First, we evaluate the performance of pruning 
by support (eps) compared with the original exact algorithm (ea) . As shown in 
(a) and (b) in Figure 1, 2 and 3, the EA, which verifies all the possible candidates, 
should have the same cost no matter how r/ s and rj c set. Therefore, the time 
cost of EA in (a) is exactly the same as that in (b) in all three data sets. 

Moreover, the EPS achieves significantly lower time cost in all the statis- 
tical distributions, which is only about 1/10 of that of the EA. These results 
demonstrate that our EPS approach can prune most of candidates without costly 
computation. Note that the time costs of approaches increase linearly with data 
sizes, which shows the scalability of discovering MDs on large data. 

To observe more accurately, we also plot the EPS time cost in Figure 4, 5 
and 6 with the same settings respectively. According to the pruning strategy, 
the EPS performance is only affected by support requirement r/ s . In other words, 
different r\ c settings take no effect on EPS. Thus, EPS has similar time costs in 
Figure 5 (a) and (b) with the same rj s but different r\ c . Similar results can be 
observed in Figure 6 as well. 

On the other hand, the EPS approach conducts the pruning based on the 
given requirement of support rj s . It is natural that a higher r/ s turns to the 
better pruning performance. Therefore, EPS with rj s = 0.04 in Figure 4 (a) 
shows lower time cost, e.g., about 0.4s for 150k, than that of tj s = 0.01 in (b), 
e.g., 0.6s for the same 150k. Similar results with different r/ s are also observed 
on Cora and Restaurant, which are not presented due to the limit of space. 

Advanced Approach Evaluation Now, we report the performance of ad- 
vanced pruning and approximation techniques in Figure 4, 5 and 6, including 
the pruning by both support and confidence (epsc), the approximation together 
with pruning by support (aps), and the approximation individually together 
with pruning by support (APSl). 

First, we study the influence of rj c in different approaches. When the confi- 
dence requirement r\ c is high, e.g., in Figure 5 (b) and 6 (b), the EPSC can remove 
those low confidence candidates and shows better time performance than other 
approaches. On the other hand, when rj c is small, e.g., rj c = 0.15, we can have 
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larger choices of e G (0, 1 — r] c ) such as e = 0.8 in Figure 5 (a) and 6 (a). Thus, 
the approximation approaches have lower time cost, especially the APSI. Ac- 
cording to this analysis, we can choose EPSC in practical cases if the requirement 
r] c is high; otherwise, the APSI is preferred in order to achieve lower time costs. 

According to the bound condition of approximation approaches in Theo- 
rem 1, not only e, but also the r/ s affects the performance. As presented in 
Figure 4 (a), a higher r] s contributes a larger bound condition, which means the 
early termination of the program. Thus, approximation approaches show better 
performance in Figure 4 (a), having r/ s = 0.04, compared with Figure 4 (b), 
whose r] s = 0.01. 

Note that the bound condition also depends on the distribution features. A 
preferred distribution with more tuples in (3 can achieve the bound condition 
and terminate early, such as 50k in Figure 5 (a) with low time cost. 

Finally, we evaluate the approximate confidence and support of the returned 
MDs with e = 0.8 on both two datasets in Figure 7 and 8. As we proved in 
Lemma 3, the error introduced in approximation approaches is bounded by e 
on both confidence and support. Therefore, in Figure 7 and 8, the approxi- 
mate confidence and support of APS and APSI are very close to those of exact 
algorithms. 

Consequently, the approximate algorithm can achieve low time cost (e.g., in 
Figure 5 (a), 6 (a) and 4 (a) with the same setting of e) without introducing 
large variation in the confidence and support measures compared with the exact 
ones. 

Summary The experiment results demonstrate that our pruning and approx- 
imation techniques can significantly improve the efficiency of discovering mds. 
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i) The time costs of approaches increase linearly with data sizes, which shows 
the scalability of discovering MDs on large data, ii) The EPS approach can sig- 
nificantly reduce the time costs by pruning candidates, compared with the EA. 
iii) If the minimum confidence requirement r\ c is high, the pruning by confi- 
dence works well, iv) Otherwise, we can employ the approximation approaches 
to achieve low time cost. 

7 Conclusions 

In this paper, we study the discovery of matching dependencies. First, we 
formally define the utility evaluation of matching dependencies by using support 
and confidence. Then, we introduce the problem of discovering the MDs with 
minimum confidence and support requirements. Both pruning strategies and 
approximation of the exact algorithm are studied. The pruning by support 
can filter out the candidate patterns with low supports. In addition, if the 
minimum confidence requirement is high, the pruning by confidence works well; 
otherwise, we can employ the approximation approaches to achieve low time 
cost. The experimental evaluation demonstrates the performance of proposed 
methods. 

Since this is the first work on discovering the matching dependencies, there 
are many aspects of work to develop in the future. For example, although the 
current approach can exclude the attributes that are not necessary to a md, 
another issue is to minimize the number of attributes in the MD. However, 
the problem of determining attributes for fds is already hard [19], where the 
matching similarity thresholds are not necessary to be considered. Moreover, 
two different MDs may cover different dependency semantics, which leads us 
to the problem of generating MDs set. Rather than a single MD, the utility 
evaluation of a MDs set is also interesting. Finally, and most importantly, more 
exiting applications of mds are expected to be explored in the future work. 
Finally, along the same line as evaluating fds [22, 25], the MDs utility can also 
be measured by the smallest number of tuples that would have to be removed 
from the relation in order to eliminate all violations. 
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